CELLAR: A Data Modeling System for Linguistic Annotation
نویسنده
چکیده
CELLAR is not a particular annotation schema, but is a system for expressing and building annotation schemas. The paper illustrates how an annotation schema is expressed as an XML document that defines classes of objects, their properties, and the relationships between objects. The schema is then implemented via automatic conversion to a relational database schema and an XML DTD for data import and export. Requirements for Linguistic Data Modeling CELLAR is a data modeling system that was built specifically for the purpose of linguistic annotation. It was designed to model the following five fundamental aspects of the nature of linguistic data: 1. The data in a text unfold sequentially; the data model must therefore be able to represent text in proper sequence. 2. The data are hierarchically structured; the data model must therefore be able to express hierarchical structures of arbitrary depth. 3. The data elements bear information in many simultaneous dimensions; the data model must therefore be able to annotate data objects with many simultaneous properties. 4. The data are highly interrelated; the data model must therefore be able to encode associative links between related pieces of data. 5. The data are multilingual; the data model must therefore be able to keep track of what language each datum is in. Conceptual Modeling in CELLAR CELLAR is not a particular annotation schema, but is a system for expressing and building annotation schemas. A particular annotation schema is called a conceptual model and is expressed as an XML document which defines classes of objects, their properties, and constraints on the values of properties and the relationships between objects. A simplified version of the DTD for expressing conceptual models is given on the next page. Modeling begins by identifying the classes of things in the world being modeled (i.e., the objects of objectoriented modeling or the entities of entity-relationship modeling). A class definition consists of documentation 1 The acronym is for "Computing Environment for Linguistic, Literary, and Anthropological Research". (See http://www.sil.org/cellar/ for more information.) CELLAR has been a team effort over the years and I am thus indebted to a host of colleagues. The leader of the development team has been John Thomson. Others who have had a major role in designing the things described in this paper are Shon Katzenberger, Stephen McConnel, and Ken Zook. 2 "Linguistic annotation," by Steven Bird and Mark Liberman, Linguistic Data Consortium (2000). http://www.ldc.upenn.edu/annotation/. 3 "The nature of linguistic data and the requirements of a computing environment for linguistic research," by Gary F. Simons. In Using Computers in Linguistics: a practical guide, edited by John Lawler and Helen Aristar Dry, pages 1025. London and New York: Routledge (1998). An earlier version is available at http://www.sil.org/computing/computing_environment.html. 4 XML is the "Extensible Markup Language"; see http://www.w3.org/TR/REC-xml for the specification and http://www.oasis-open.org/cover/ for a host of related resources. and a set of property definitions (which implement the third requirement above). Each class has a base class from which it also inherits properties; the ultimate base class is CmObject, for "conceptually modeled object." There are three kinds of properties: (1) Owning properties implement the part-whole relationships entailed by the second requirement that data are hierarchically structured. The cardinality attribute can specify that the owned objects form a sequence, thus supporting the first requirement. (2) Relationship properties implement the fourth requirement, that the model must support associative links between related data objects. The signature attribute constrains what classes of objects can be the targets of a link. (3) Basic properties store primitive data values like numbers, strings, Booleans, dates, and binary images (such as for graphics or sound). The fifth requirement that data are multilingual is supported by the fact that the primitive type String allows spans of characters to be identified as to language, and MultiString and MultiUnicode support alternate renditions of the same string in multiple languages. As a conceptual model is developed in XML, descriptions of the classes and properties are included right inside the definitions. As a result, an XSL stylesheet is able to render the conceptual model source code as hyperlinked documentation in a browser. An Overview of the Implementation CELLAR was first implemented in Smalltalk (beginning in 1990) as a stand-alone object-oriented database management system. The data store and the applications to manipulate those data were both modeled within the object-oriented database. This system was used to build the text analysis and lexicon management tools in the product named LinguaLinks. Now, ten years later, we are building a second-generation system based on a client-server model, which separates the implementation of the data store from that of the applications. This offers a number of advantages: (1) we are able to use an off-the-shelf product for the data server, (2) applications can be written in a variety of languages (including C++, Java, Visual Basic, and Dynamic HTML), (3) client and server can be on the same machine or can communicate across a network, and (4) the transaction processing capability of the data server opens up the possibility of safe multi-user write access to the same database. A relational database engine is being used to implement the object store. Specifically, we are using Microsoft's SQL Server. In installations for use by individual linguists, we are able to use the Microsoft Data Engine (MSDE) which is a freely distributable 1-to-5-user version of SQL Server. Figure 1 gives an overview of the system implementation. The box represents the off-the-shelf database server. The ovals represent software that our project has implemented. The arrows represent data flows. The numbered items represent text files that serve as input or output. The dashed lines represent the relationship between a valid XML document (on the right) and the DTD to which it conforms (on the left). Figure 1. Overview of the CELLAR system Building a CELLAR application begins with the Conceptual Model Builder. This has three components: the Class Manager (which gives a graphical user interface for creating and editing the XML representation [2] for the class definitions in a conceptual model in such a way that they conform to the DTD for conceptual models [1]), the Code Generator (which generates the SQL Data Definition Language code [3] that will implement the conceptual model as relational database schema), and the DTD Generator (which generates a DTD [4] that defines the set of XML documents that can be loaded directly into that conceptual model). The SQL code [3] is then executed by SQL Server to build the schema for the database. Once the database schema has been created, the database can be populated with data by using the Data Import Client to read it from an XML document [5] that conforms to the DTD for this conceptual model [4] that was generated by the Conceptual Model Builder. The Data Export Client does the reverse, reading data from the database and outputting in XML that conforms to the same DTD. Multiple client application programs can be implemented in various languages to access and manipulate a given database; the client provides a user interface that is optimal for the task at hand, while the SQL data engine maintains data integrity (according to the constraints expressed in the conceptual model). 5 See http://www.sil.org/lingualinks/LingTool.html. 6 See http://msdn.microsoft.com/vstudio/msde/.
منابع مشابه
A CAD System Framework for the Automatic Diagnosis and Annotation of Histological and Bone Marrow Images
Due to ever increasing of medical images data in the world’s medical centers and recent developments in hardware and technology of medical imaging, necessity of medical data software analysis is needed. Equipping medical science with intelligent tools in diagnosis and treatment of illnesses has resulted in reduction of physicians’ errors and physical and financial damages. In this article we pr...
متن کاملImprovements to Dependency Parsing Using Automatic Simplification of Data
In dependency parsing, much effort is devoted to the development of new methods of language modeling and better feature settings. Less attention is paid to actual linguistic data and how appropriate they are for automatic parsing: linguistic data can be too complex for a given parser, morphological tags may not reflect well syntactic properties of words, a detailed, complex annotation scheme ma...
متن کاملFuzzy Neighbor Voting for Automatic Image Annotation
With quick development of digital images and the availability of imaging tools, massive amounts of images are created. Therefore, efficient management and suitable retrieval, especially by computers, is one of themost challenging fields in image processing. Automatic image annotation (AIA) or refers to attaching words, keywords or comments to an image or to a selected part of it. In this paper,...
متن کاملALIP: The Automatic Linguistic Indexing of Pictures System
In this demonstration, we present the Automatic Linguistic Indexing of Pictures (ALIP) system. The system annotates images with linguistic terms, chosen among hundreds of such terms. The system uses a wavelet-based approach for feature extraction, a statistical modeling process for training, and a statistical significance processor to annotate images. We implemented and tested our ALIP system o...
متن کاملAn annotation scheme for Persian based on Autonomous Phrases Theory and Universal Dependencies
A treebank is a corpus with linguistic annotations above the level of the parts of speech. During the first half of the present decade, three treebanks have been developed for Persian either originally or subsequently based on dependency grammar: Persian Treebank (PerTreeBank), Persian Syntactic Dependency Treebank, and Uppsala Persian Dependency Treebank (UPDT). The syntactic analysis of a sen...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2000